Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] Truncate search results to prevent OOM on jaeger-query. #1202

Closed

Conversation

annanay25
Copy link
Member

Resolves #1051.

Changes:

  • Added upper bound on number of traces to fetch from a Cassandra backend.
  • In addition to upper bound on traces, added upper bound on number of spans to fetch from an ES backend.

Question: Do we need a flag to mark the result as truncated on the UI?

Signed-off-by: Annanay [email protected]

@annanay25 annanay25 changed the title [Storage] Truncate search results to prevent OOM on jaeger-query. [query] Truncate search results to prevent OOM on jaeger-query. Nov 20, 2018
Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

@@ -237,7 +237,10 @@ func (s *SpanReader) FindTraces(ctx context.Context, traceQuery *spanstore.Trace
}
if traceQuery.NumTraces == 0 {
traceQuery.NumTraces = defaultNumTraces
} else if traceQuery.NumTraces > 10000 {
traceQuery.NumTraces = 10000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this hardcoded upper limit? It should be user-configurable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Also -

Going through the current traceQueryParameters()
https://github.com/jaegertracing/jaeger/blob/master/cmd/query/app/query_parser.go#L139 - NumTraces seems to be defined as defaultQueryLimit but in its implementation it's used when NumTraces is not passed explicitly - https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/cassandra/spanstore/reader.go#L238

Should this be renamed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ This is fine. My bad, confused the two.

@annanay25
Copy link
Member Author

So I'm a little confused. Each query can already be parameterised with a limit on number of traces - https://github.com/jaegertracing/jaeger/blob/master/cmd/query/app/query_parser.go#L139 (if not specified then a default is used) - doesn't this resolve #1051? If no, then why not?

@yurishkuro
Copy link
Member

yurishkuro commented Nov 28, 2018

The original ticket was about individual traces with large number of spans (we've seen traces with tens of millions of spans).

@annanay25
Copy link
Member Author

@yurishkuro OK.

In that case, I will drop the following changes to plugin/storage/cassandra/spanstore/reader.go

	if traceQuery.NumTraces == 0 {
		traceQuery.NumTraces = defaultNumTraces
	} else if traceQuery.NumTraces > 10000 {
		traceQuery.NumTraces = 10000
        }

But will retain the following TerminateAfter(defaultNumSpans) -

searchRequests[i] = elastic.NewSearchRequest().IgnoreUnavailable(true).Type(spanType).Source(elastic.NewSearchSource().Query(query).Size(defaultDocCount).TerminateAfter(defaultNumSpans).Sort("startTime", true).SearchAfter(nextTime))

Should defaultNumSpans be user configurable? This would a change in the parameter list of the function multiRead().

@annanay25 annanay25 force-pushed the add-numtrace-upperbound branch from 29a998d to f67231c Compare December 9, 2018 07:15
@codecov
Copy link

codecov bot commented Dec 9, 2018

Codecov Report

Merging #1202 into master will decrease coverage by 0.06%.
The diff coverage is 58.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1202      +/-   ##
==========================================
- Coverage     100%   99.93%   -0.07%     
==========================================
  Files         160      160              
  Lines        7181     7189       +8     
==========================================
+ Hits         7181     7184       +3     
- Misses          0        4       +4     
- Partials        0        1       +1
Impacted Files Coverage Δ
plugin/storage/es/spanstore/reader.go 100% <100%> (ø) ⬆️
cmd/query/app/query_parser.go 94.79% <37.5%> (-5.21%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eed786f...a509e7d. Read the comment docs.

@annanay25 annanay25 force-pushed the add-numtrace-upperbound branch from f67231c to e73dcd4 Compare December 9, 2018 07:16
@annanay25
Copy link
Member Author

PR updated.

Need to figure out a way parameterise GetTrace() (would require changes to the spanstore.Reader interface!) with numSpans if a user wants to force retrieval of all spans.

@annanay25
Copy link
Member Author

@yurishkuro could you please re-run Travis on this?

@pavolloffay
Copy link
Member

There is a git conflict.It needs to be rebased.

@annanay25 annanay25 force-pushed the add-numtrace-upperbound branch from e73dcd4 to a509e7d Compare December 26, 2018 17:48
@annanay25
Copy link
Member Author

@pavolloffay I've rebased and removed conflicts.

Also, if we want to allow users to force retrieval of more than 10k spans for a trace, we'd have to make changes to the spanstore.Reader interface. What's our take on that?

@@ -30,12 +30,14 @@ import (

const (
defaultQueryLimit = 100
defaultSpansLimit = 10000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was no default previously, we should not introduce one. I.e. default should be 0, which is == unlimited.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the very last comment


operationParam = "operation"
tagParam = "tag"
tagsParam = "tags"
startTimeParam = "start"
limitParam = "limit"
limitSpansParam = "limitSpans"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/limitSpans/maxSpansPerTrace/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the very last comment

@@ -105,6 +107,16 @@ func (p *queryParser) parse(r *http.Request) (*traceQueryParameters, error) {
limit = int(limitParsed)
}

limitSpansParam := r.FormValue(limitSpansParam)
limitSpans := defaultSpansLimit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scope the string var to the if statement:

var maxSpansPerTrace int
if maxSpansParam := r.FormValue(maxSpansPerTraceParam); maxSpansParam != "" {
}

@@ -269,7 +269,7 @@ func (s *SpanReader) multiRead(ctx context.Context, traceIDs []string, startTime
if val, ok := searchAfterTime[traceID]; ok {
nextTime = val
}
searchRequests[i] = elastic.NewSearchRequest().IgnoreUnavailable(true).Type(spanType).Source(elastic.NewSearchSource().Query(query).Size(defaultDocCount).Sort("startTime", true).SearchAfter(nextTime))
searchRequests[i] = elastic.NewSearchRequest().IgnoreUnavailable(true).Type(spanType).Source(elastic.NewSearchSource().Query(query).Size(defaultDocCount).TerminateAfter(numSpans).Sort("startTime", true).SearchAfter(nextTime))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference between Size() and TerminateAfter()?

@@ -54,6 +54,7 @@ const (

defaultDocCount = 10000 // the default elasticsearch allowed limit
defaultNumTraces = 100
defaultNumSpans = 10000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has no effect currently since multi-read is already bound by defaultDocCount

@@ -51,4 +51,5 @@ type TraceQueryParameters struct {
DurationMin time.Duration
DurationMax time.Duration
NumTraces int
NumSpans int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however, why does this need to be passed from the interface? I think the overall feature is intended to protect storage from OOM, so the parameterization should be done via a parameter when instantiating the storage, not at query time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yurishkuro, I think this particular issue was filed for OOM on the jaeger-query nodes, and we have a separate issue for handling overloading of the underlying storage - #960

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing it in the storage will address query-service OOM.

Copy link
Member Author

@annanay25 annanay25 Jan 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Will look into. Closing this PR.

So do we want to restrict number of spans (per trace) at ingestion? @yurishkuro

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at ingestion (which isn't possible anyway since spans arrive async). We want the limit to be applied at query time, similar to the changes you made to plugin/storage/es/spanstore/reader.go, but the value of the limit to be configurable as a parameter of the query-service, passed at Reader construction from CLI, not as a parameter of the HTTP request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry its taking so many iterations to get this one right. Understood now, sending in a new, clean PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants